merge imports #1394

obfusk · 2023-06-26T14:42:12Z

Replaces #1389.

This renumbers IDs -- both from the CSV and in the image file names -- that are present in the imported data, to make sure they do not conflict with the existing data.

It also removes the newly added image files if the import fails, which is now safe since there are no conflicts with existing IDs.

Only the Catima importer has to deal with IDs being present in the imported data; all other importers create new IDs, so they are not affected.

I'm not really handling duplicate group names, but errors from DBHelper.insertGroup() are ignored and I think it makes sense to just effectively merge groups with identical names anyway; we do need to deal with that when we have proper group IDs.

obfusk · 2023-06-28T12:31:30Z

After this change, the only errors ignored during import should be duplicate groups (which is fine), unless the data being imported does not conform to the expected format somehow, which should not happen under any normal circumstances.

But a lot of the code still doesn't actually check for errors.
That seems like something we might still want to address, in a separate PR.
Let me know how you'd like to proceed with that :)

obfusk · 2023-06-28T12:49:10Z

unless the data being imported does not conform to the expected format somehow

It looks like the parser is handling that already. So I can't guarantee any overlooked edge cases, but I'm not seeing any potential errors being overlooked now, except for duplicate groups and group mappings, which are perfectly fine to ignore, or something being wrong with the database itself. But I don't expect the parser to produce invalid/conflicting data, and I expect valid data to not result in errors.

I still think error checking is a good idea anyway. But it doesn't look like a high priority after this.

obfusk · 2023-06-28T12:53:46Z

There could be an NPE from invalid data. But that should be caught by the importer and cause the import to fail.

TheLastProject · 2023-06-29T12:42:19Z

I'm wondering, wouldn't it be cleaner to change how the id database value for cards is generated open card creation (like: unix timestamp instead of incrementing by 1 for each new card)?

As I understand your change this doesn't "merge" imports, it just assumes that anything in the import doesn't exist yet so it switches the brokenness from silently dropping cards to duplicating cards which I agree is an improvement but still not the desirable status.

TheLastProject · 2023-06-29T12:44:28Z

See also #513 and especially #512 (comment) for what feels like a better flow. And CatimaLoyalty/Docs#4 (comment).

obfusk · 2023-06-29T13:37:06Z

I'm wondering, wouldn't it be cleaner to change how the id database value for cards is generated open card creation (like: unix timestamp instead of incrementing by 1 for each new card)?

We would still need to deal with existing exported data using old IDs though. So that would add complexity. And might still not produce the desired result if you e.g. add the same card to two devices and then want to import one into the other. This would pretty much only help avoid duplicating the exact same card added on the same device. And then you could still have two different versions that were modified independently on different devices or when merging exports from different times.

As I understand your change this doesn't "merge" imports, it just assumes that anything in the import doesn't exist yet so it switches the brokenness from silently dropping cards to duplicating cards which I agree is an improvement but still not the desirable status.

Fair enough. The point here is to make sure nothing gets overwritten/broken. Having to delete a few duplicates manually is much better than the current situation. I was focused on that, and not so much on dealing with duplicates.

Adding a conflict dialog like Nextcloud seems like quite a bit of work, and since duplicate file/card names are not an issue here, I'm not sure it makes sense to do the same. If you want to go that route I'd suggest merging this for now and making a separate PR for the additional work.

It would be relatively easy to add something like this though:

prefer existing data / prefer imported data / don't check for duplicates (i.e. what it does now)
choose which fields must be equal for the card to be considered "the same", defaulting to cardid + name maybe

I can add that to this PR if you'd like.

obfusk · 2023-06-29T13:39:03Z

Also note that only catima exports contain IDs. If we want "intelligent merging", it seems good to apply that to all other import formats as well, not just catima. For consistency.

obfusk · 2023-06-29T13:50:54Z

prefer existing data / prefer imported data / don't check for conflicts

A fourth option would be to prefer latest lastused, but that seems a bit brittle.

obfusk · 2023-06-29T13:54:30Z

I can add that to this PR if you'd like.

Or we could merge this as-is and I'll make a follow-up PR for that. Let me know what you prefer.

obfusk · 2023-06-30T11:48:04Z

actually, I think this would not be too hard after all:

ask / prefer existing data / prefer imported data / don't check for duplicates (i.e. what it does now)
choose which fields must be equal for the card to be considered "the same", defaulting to cardid + name

I think I'll implement that in catimerge.

what do you think?

TheLastProject · 2023-07-01T12:33:30Z

The main issue is that the current import code needs a complete rewrite anyway (see #513). Because it'll require moving the whole thing into a background thread and requires spawning a notification, using a notification on conflicts and offering options seemed easier but I guess it probably is still challenging.

My main concern is complicating the current importing code and making moving away from the current deprecated and crashy method more work.

My secondary concern is switching to a new method that won't work well for multi-device sync and I think randomly increasing the IDs if the ID is already in use will be one of those things where it'll severely complicate multi-device sync and risk the possibility of getting stuck in loops where IDs keep getting incremented each sync because they keep conflicting.

Aside from that, messing with the IDs seems very hacky, like the kind of think that will bite us in the future.

You do raise a good point with that other importers don't share Catima's ID system and the only combination of fields we can reasonably assume to always exist is name, type and value and perhaps merging based on those values makes most sense. It would be nice to detect duplicates based on those fields but without any way to notify users a duplicate was found that is currently useless.

So, in my mind, the cleanest solution is:

(Temporary hack) Drop all data before import
Migrate the importer to a background thread so it won't use deprecated code and won't crash when users navigate away or try to import from for example Nextcloud
Write a good merging solution and notify users on conflicts to ask them what to do (at this point we can revert step 1)

I don't really think writing merging code while the importing code is in such a broken state and we can't reliably notify users of conflicts makes a lot of sense as it'll just translate into more code to move to the new system and more things to test before the new system can be considered stable.

obfusk · 2023-07-01T13:53:44Z

severely complicate multi-device sync

That wasn't something I was considering in this context. But good point. We should address that.

(Temporary hack) Drop all data before import

#1389 does that. So feel free to merge that as-is (or request changes). Unfortunately, if the import fails the DB will be unchanged but images may have been overwritten. Preventing that would be quite hacky with the current code, even more so than this PR which at least should keep both old and new data completely intact.

Migrate the importer to a background thread

That does seem like a good idea. Sadly, I have no experience with that so it'll take more time to figure out. But I'd like to help with that. Any pointers on how to get started with that?

Write a good merging solution and notify users on conflicts to ask them what to do

My proposal above is an attempt to formulate such a merging strategy. It could almost certainly use some tweaking but the basic concept seems sound to me personally, though I could easily have missed a use case I don't have myself that it doesn't cover.

Feedback on that idea so we can figure out the best merge strategy before implementing it would be appreciated. I would then like to implement that in catimerge so it can be evaluated before modifying Catima itself.

And yes, if we're going to implement a UI for conflict resolution, we should probably do it as part of the rewrite and not based on the current implementation.

TheLastProject · 2023-07-13T17:55:01Z

Unfortunately, if the import fails the DB will be unchanged but images may have been overwritten. Preventing that would be quite hacky with the current code, even more so than this PR which at least should keep both old and new data completely intact.

I may be missing something but I'd expect this to be fairly simple to solve?

Get a list of all IDs
Start a database transaction
Delete all cards
Try to import all cards (on fail, exit)
Delete the images for every single card ID retrieved in 1
Import all images from the import

But maybe it's better to just do nothing until we have a good fix because any workaround we do not we'll just have to revert again in the future.

Migrate the importer to a background thread

That does seem like a good idea. Sadly, I have no experience with that so it'll take more time to figure out. But I'd like to help with that. Any pointers on how to get started with that?

At #622 (comment) I talked about using a Service but I might be wrong there too. I don't know what the recommended Android method is for long-time background processing (maybe WorkManager?)

My proposal above is an attempt to formulate such a merging strategy. It could almost certainly use some tweaking but the basic concept seems sound to me personally, though I could easily have missed a use case I don't have myself that it doesn't cover.

I see one big issue with this: asking the user what they want. You can't use that for sync constantly, you'd have to bother users every sync. I'm thinking we should just change the ID to be the current UNIX timestamp on creation and consider any difference in field a conflict and offer the option to using 1, 2 or keep both (with keeping both we can insert the new card with the current timestamp). That way, the only conflicts anyone could possibly get would be multiple users saving a card on the exact same second which seems unlikely.

obfusk · 2023-07-13T20:18:42Z

I may be missing something but I'd expect this to be fairly simple to solve?

In theory, yes. But currently, parsing the data and updating the database is interleaved with saving the new images. So existing images may have been overwritten by the time the database transaction is aborted.

We could e.g. save new images with a tmp_ prefix and rename them after the transaction is committed (or delete them when it's aborted), that would be simpler than modifying the current control flow (which doesn't seem worth doing given that we plan to rewrite it all). I can add that to #1389 if you want.

But to me this PR seemed the better interim solution as it guarantees that no data is lost or overwritten. Though of course it may be duplicated, which is clearly a concern for you. Just let me know how you want to proceed :)

obfusk · 2023-07-13T20:53:59Z

I see one big issue with this: asking the user what they want. You can't use that for sync constantly, you'd have to bother users every sync. I'm thinking we should just change the ID to be the current UNIX timestamp on creation [...] That way, the only conflicts anyone could possibly get would be multiple users saving a card on the exact same second which seems unlikely.

I agree it makes sense to use timestamps to avoid duplicate IDs. That does indeed make syncing and importing easier since we rarely need to modify the ID. It does not however solve the problem with existing IDs (and imports from formats without IDs). As I see it, that leaves us with:

1. Same ID, 100% identical data:

Trivial :)

Though what about e.g. two identical cards with different images?

2. Same ID, different data (e.g. modified independently, could be a completely different name or barcode or just a modified note or colour):

consider any difference in field a conflict and offer the option to using 1, 2 or keep both (with keeping both we can insert the new card with the current timestamp).

Agreed. But what will we do next sync?

Also: last used timestamp will likely differ, so maybe we should ignore that one field (not sure about header colour and star status).

3. Different ID, 100% identical data (e.g. adding the exact same card on two devices independently):

We need to choose which ID to use, but that should not be too hard.

Also: last used timestamp will likely differ, leading us to 4 unless we ignore that.

4. Different ID, not 100% identical data:

This is the hard one. I think if e.g. the name and barcode (I suggested letting the user choose which fields to use for that, but that seems like a good default) are identical we should consider this a conflict as well and handle it the same way as 2.

Also: this could happen to more than one pair of cards at a time. And what will we do next sync? If we keep both, there should not be a conflict as we have an identical card with the same ID.

Also: how is sync intended to work? Unidirectional? Bidirectional? Will it remember how conflicts were handled before?

obfusk added 2 commits June 26, 2023 16:57

importer: add and log newImageFiles + maxLoyaltyCardId

47a0db2

catima importer: renumber IDs to merge properly

6e4a944

obfusk force-pushed the merge-imports branch from a7d3f2a to 6e4a944 Compare June 26, 2023 15:00

obfusk marked this pull request as ready for review June 26, 2023 15:05

obfusk mentioned this pull request Jun 26, 2023

Import issues #1387

Closed

obfusk mentioned this pull request Jun 28, 2023

DB: migrate to unique IDs for groups #1333

Draft

12 tasks

obfusk marked this pull request as draft June 30, 2023 13:13

obfusk mentioned this pull request Jul 17, 2023

refactor importer #1430

Open

9 tasks

Korb mentioned this pull request Mar 27, 2024

Simultaneous usage on two or more devices: thoughts #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge imports #1394

merge imports #1394

obfusk commented Jun 26, 2023 •

edited

Loading

obfusk commented Jun 28, 2023

obfusk commented Jun 28, 2023

obfusk commented Jun 28, 2023

TheLastProject commented Jun 29, 2023

TheLastProject commented Jun 29, 2023 •

edited

Loading

obfusk commented Jun 29, 2023 •

edited

Loading

obfusk commented Jun 29, 2023

obfusk commented Jun 29, 2023

obfusk commented Jun 29, 2023

obfusk commented Jun 30, 2023

TheLastProject commented Jul 1, 2023

obfusk commented Jul 1, 2023

TheLastProject commented Jul 13, 2023

obfusk commented Jul 13, 2023

obfusk commented Jul 13, 2023

merge imports #1394

Are you sure you want to change the base?

merge imports #1394

Conversation

obfusk commented Jun 26, 2023 • edited Loading

obfusk commented Jun 28, 2023

obfusk commented Jun 28, 2023

obfusk commented Jun 28, 2023

TheLastProject commented Jun 29, 2023

TheLastProject commented Jun 29, 2023 • edited Loading

obfusk commented Jun 29, 2023 • edited Loading

obfusk commented Jun 29, 2023

obfusk commented Jun 29, 2023

obfusk commented Jun 29, 2023

obfusk commented Jun 30, 2023

TheLastProject commented Jul 1, 2023

obfusk commented Jul 1, 2023

TheLastProject commented Jul 13, 2023

obfusk commented Jul 13, 2023

obfusk commented Jul 13, 2023

1. Same ID, 100% identical data:

2. Same ID, different data (e.g. modified independently, could be a completely different name or barcode or just a modified note or colour):

3. Different ID, 100% identical data (e.g. adding the exact same card on two devices independently):

4. Different ID, not 100% identical data:

obfusk commented Jun 26, 2023 •

edited

Loading

TheLastProject commented Jun 29, 2023 •

edited

Loading

obfusk commented Jun 29, 2023 •

edited

Loading